Likelihood

Author
Affiliation

Dave Clark

Binghamton University

Published

August 25, 2024

Motivating MLE


ML, OLS?

  • Why do we turn to maximum likelihood instead of OLS, especially given the simplicity and robustness of OLS?

  • Our data can’t meet the OLS assumptions; \(y\) is often limited* such that we observe \(y\) but really want to measure \(y^*\).

  • \(y^*\) is often a continuous (unlimited) variable we wish we could measure - if we could, we’d use OLS to estimate its correlates.

  • OLS asks us to make the data satisfy the model. MLE asks us to build a model based on the data. Since our data often don’t satisfy the OLS assumptions, MLE offers a flexible alternative.


ML

ML asks us to think of the data as given and to imagine the model that might best represent the Data Generating Process.

  • In OLS, the model is fixed - the assumptions about \(\epsilon\) are given.
  • In ML, the data are fixed - we construct a model that reflects the nature of the data.
  • What we assume about the unobservables is informed by what we know about the observables, the \(y\) variable.
  • A useful way to describe \(y\) is to characterize its observed or empirical distribution.
  • Once we know the distribution of \(y\), we can begin building a model based on that distribution.

Limited Dependent Variables

We are limited in what we can observe and/or measure in \(y\). E.g., we observe an individual voting or not; we cannot observe the chances an individual votes.

  • The \(y\) variable has both observed and latent qualities; label the latent variable \(\widetilde{y}\).

  • Often, we are more interested in this latent variable. We are more interested in the latent chance of voting than the observed behavior.

  • \(\widetilde{y}\) is the variable we wish we could measure.

  • The latent and observed variables are often distributed differently. Observed voting \(y=(0,1)\); latent variable \(Pr(y=1)\).

  • Linking \(X\) variables to the latent \(\widetilde{y}\) requires rescaling; this is often because changes in \(\widetilde{y}\) given \(X\) are nonlinear and in different units. In the example above, \(y\) is a binary realizing of voting, so is binomial. \(\widetilde{y}\) is the probability of voting, so is continuous, bounded between 0 and 1.

  • Besides, \(\widetilde{y}\) is unobserved - so we have to generate it from the model.

  • To link \(X\) with \(\widetilde{y}\) or \(y\), we assume their relationship follows some distribution; we call this the link distribution.


The big point

Our data are often limited, and not suitable for OLS. OLS asks us to transform data to make it meet the model assumptions (e.g., Generalized Least Squares); MLE builds the model based on the data.


Building an ML estimator

What do we need to build an ML estimator?

  • \(Y\) variable distribution.

  • use that distribution to write a function connecting \(x\beta\) (the linear estimates), to \(Y\).

  • use another another function to link the linear prediction, \(x\beta\) to \(\tilde{y}\), to map the linear prediction onto the latent variable we wish we could measure.


For example …

  • \(Y = (0,1)\), does an individual vote or not. This is binary, discrete, let’s say binomial.

  • \(\tilde{y}\) is the latent probability an individual votes. Wish we could measure this, use OLS.

  • Write a function based on \(Y\): \(\pi^y (1-\pi)^{(1-y)}\) , the binomial PDF (without the n-tuple).

  • \(\pi\) is given by some \(X\) variables we think affect voting - so \(\pi = x\beta\).

  • So substitute for \(\pi\)\((x\beta)^y (1-x\beta)^{(1-y)}\)

  • Now, link the linear prediction to \(\tilde{Y}\) - map \(x\beta\) onto the Pr(vote). How about using the Standard Normal CDF \(\Phi\)?

\(\Phi(x\beta)^y \Phi(1-x\beta)^{(1-y)}\) - this is now the core of the Probit.


Coming up next

That’s what we want to be able to do - so in the coming parts, we’ll:

  • review ML theory, using Gary King’s notation.

  • build a likelihood function for a continuous variable, and compare to OLS.

  • look at the technology of ML - how do we actually get estimates from a likelihood function?


Likelihood Theory

Probability and Likelihood differ from one another principally by how they treat the data and model in relation to one another.

Probability theory presumes some given model (or set of parameters) and seeks to estimate the data, given those parameters.


Likelihood

King (pp. 9, 14) puts it this way:

\[Y\sim f(y|\theta,\alpha)\]

and

\[\theta = G(X,\beta)\]

so our data, \(Y\) has a probability distribution given by parameters \(\theta, \alpha\), and \(\theta\) is a function of some variables, \(X\) and their parameters, \(\beta\).


All of this comprises the model in King’s lingo, so a basic probability statement appears as:

\[Pr(y|\mathcal{M}) \equiv Pr(\mathrm{data|model})\]

This is a conditional probability resting on two conditions:

  • The data are random and unknown.

  • The model is known.

Uh oh. The model is known? The data aren’t? This works in many probability settings. What is the probability of rolling 3 threes in 6 rolls of a six-sided die? What is the probability you draw an ace from a standard deck of cards? In cases like these, the model is known even before we observe events (data).


In cases like these, the model’s parameters are known and fixed. In our applications, the model and its parameters are the unknowns, but the events or data are known, fixed, and given.

Suppose I flip a coin to decide whether I roll a 20-sided die, or a 6-sided die. You do not observe any of this - I only tell you I rolled a 4 - which die did I roll? This is the problem we try to deal with in ML - what is the data generating process that most likely produced the observed data. Unlike the simple die roll, the model isn’t known. Instead, there are many possible models that could have produced the data.


Inverse Probability

Given these conditions, the more sensible model would be:

\[Pr(\mathcal{M}|y)\]

This is the inverse probability model - but it requires knowledge (or strong assumptions) regarding an important element of the (unknown) model, \(\theta\). Even Bayes can’t really do this.


Likelihood

Likelihood estimates the model, \(\mathcal{M}\) given the data, but assumes that \(\theta\) can take on different values, representing different (competing) hypothetical models.

The set of \(\theta\)’s are the competing models or data generating processes that could have produced the observed data set.


Keeping with King’s notation, the likelihood axiom is:

\[\mathcal{L}(\tilde{\theta}|y,\mathcal{M}*)\equiv L(\tilde{\theta}|y) \] \[=k(y)Pr(y|\tilde{\theta})\] \[\propto Pr(y|\tilde{\theta})\]

where \(\tilde{\theta}\) represents the hypothetical value of \(\theta\) (rather than its true value).

The term \(k(y)\) (known as the “constant of proportionality”) is a constant across all the hypothetical values of \(\tilde{\theta}\) (and thus drops out) but is the key to the likelihood axiom; it represents the functional form through which the data (\(y\)) shape \(\tilde{\theta}\) and thus allow us to estimate the likelihood as a measure of relative (instead of absolute) uncertainty; our uncertainty in this case is relative to the other possible functions of \(y\) and the hypothetical values of \(\tilde{\theta}\).

As King (p. 61) puts it , \(k(y)\) “measures the relative likelihood of a specified hypothetical model \(\tilde{\beta}\) producing the data we observed.”

It turns out that the likelihood of observing the data is proportional to the probability of observing the data.


Take Away Points

  • We want to know the probability (the model) of observing some data; if we find the model parameters with the highest likelihood of generating the observed data, we also know the probability because the two are proportional.

  • Likelihoods are always negative; they do not have a scale or specific meaning; they do not transform into probabilities or anything familiar.

  • We find the parameter values that produce the largest likelihood values; those parameters that maximize the likelihood then can be translated into sensible quantities - they can be mapped onto \(\tilde{y}\), giving us a measure of the variable we wish we had.

  • In OLS, we compute parameter estimates that minimize the sum (squared) distance of all the observed points to the predicted points - we minimize the errors in this fashion.

  • The technology of MLE is trial and error - choose some values for \(\tilde{\theta}\), compute the likelihood, then repeat. Compare all the likelihoods - the values of \(\tilde{\theta}\) that produced the highest likelihood value are the ones that most likely generated the observed data.


Likelihood: a linear model

Let’s start with some data, and build a likelihood model. Suppose you have data, \(y\), where:

\[y\sim N(\mu,\sigma^{2})\] \[E(y)=\mu\]

\[var(y)=\sigma^{2}\]

\(y\) is distributed normally with appropriate distribution parameters that measure the moments of \(y\).

Given these data, we want to know what values of \(\mu, \sigma^2\) most likely produced the data.


Writing the Likelihood

Since we’ve determined \(y\) is normal, write the Normal PDF.

\[Pr(Y=y_{i})=\frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\left[\frac{-(y_{i}-\mu_{i})^{2}}{2\sigma^{2}}\right]}\]

We’re interested in the joint probability of the observations, the probability the data result from a particular Data Generating Process. Assuming the observations in the data are independent of one another, the joint density is equal to the product of the marginal probabilities:

\[Pr(A~ \mathrm{and}~ B)=Pr(A)\cdot Pr(B)\]

so the joint probability of \(y\), written in terms of the likelihood, is given by

\[\mathcal{L} (\mu, \sigma^{2}| y )= \prod\limits_{i=1}^{n} \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\left[\frac{-(y_{i}-\mu_{i})^{2}}{2\sigma^{2}}\right]}\]

Adding is easier than multiplying; since we can transform the likelihood function by any monotonic form, we can take its natural log to replace the products with summations:

\[\ln \mathcal{L} (\mu, \sigma^{2}|y) = \ln \prod\limits_{i=1}^{n} \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\left[\frac{-(y_{i}-\mu_{i})^{2}}{2\sigma^{2}}\right]}\]

\[= \sum \ln \left[\frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{\left[\frac{-(y_{i}-\mu_{i})^{2}}{2\sigma^{2}}\right]} \right]\]

\[=\sum\left( -\frac{1}{2}(\ln(2\pi))-\frac{1}{2}(\ln(\sigma^{2}))-\frac{1}{2\sigma^{2}}\left[\sum\limits_{i=1}^{n}(y_{i}-\mu)^{2}\right] \right)\]


The Linear model

It should be pretty evident this is the linear model.

  • we started with data that looked normal; continuous, unbounded, infinitely differentiable.

  • assuming normality, we wrote a LLF in terms of the distribution parameters \(\mu, \sigma^2\).

  • but we want \(\mu\) itself to vary with some \(X\) variables; \(y\) isn’t just characterized by a grand mean, but by a set of of conditional means given by \(X\).

  • so we need to parameterize the model - write the distribution parameter as a function of some variables.

  • generally, this is to declare \(\theta = F(x\beta)\).

  • in the linear/normal case, to declare \(\mu=F(x\beta)\). Of course, \(F\) is linear, we just write \(\mu=x\beta\).

  • in the LLF, substitute \(x\beta\) for \(\mu\), and now we’re taking the difference between \(y\) and \(x\beta\), squaring those differences, and weighting them by the variance.


To put it all together \(\ldots\)

\[\ln \mathcal{L}(\mu, \sigma^{2}|y) = \ln \prod\limits_{i=1}^{n} \frac{1}{\sqrt{2 \pi \sigma^{2}}} exp \left[\frac{-(y-x\beta)^{2}}{2\sigma^{2}}\right] \] \[= \sum \ln \left\{\frac{1}{\sqrt{2 \pi \sigma^{2}}} exp \left[\frac{-(y-x\beta)^{2}}{2\sigma^{2}}\right]\right\}\]

\[= \sum\left(-\frac{1}{2}(\ln(2\pi)) -\frac{1}{2}(\ln(\sigma^{2})) -\frac{1}{2\sigma^{2}}\left[\sum\limits_{i=1}^{n}(y-x\beta)^{2}\right] \right)\]

Does this look familiar? A lot like deriving OLS, eh?


Linear regression in ML

This is the Normal (linear) log-likelihood function. It presumes some data, \(\mathbf{y}\) and some unknowns \(\mathbf{\beta, \sigma^2}\). You should note the kernel of the function is the sum of the squared differences of \(y\) and \(x\beta\).

\[\ln\mathcal{L} =\sum\left(-\frac{1}{2}(\ln(2\pi)) -\frac{1}{2}(\ln(\sigma^{2})) -\frac{1}{2\sigma^{2}}\color{red}{\left[\sum\limits_{i=1}^{n}(y-x\beta)^{2}\right]} \right)\]

It turns out if we take the derivative of this function with respect to \(\beta\) and \(\sigma^2\), the result is the ML estimator, and the OLS estimator. They’re the same.

Linearity

This model is linear - our estimates, \(x\beta\) map directly onto \(y\) - there is no latent variable, \(\tilde{y}\) to map onto - so \(x\beta = \widehat{y}\).

Note that this is not because \(y\) is normal. Rather, the fact that \(y\) is continuous and contains a lot of information and is not limited makes it more likely \(y\) is normal.

Though ML is most often used in cases where \(y\) is limited (so OLS is inappropriate), it’s important to see that we can estimate the linear model using either technology (OLS, ML) and get the same estimates.

Building a Model, Given Some Data

Now, we have enough tools to go about building a model from scratch.

Table 1 summarizes the number of soldiers who died from being kicked by a mule in the Prussian army in the late 19th century; these are the numbers of deaths per army unit. So 144 units had zero deaths, etc. How can we go about building a statistical model of mule kick deaths?

How would you begin thinking about the DGP and a statistical model? Let’s look at a summary of the data.

Table 1: Summary of Deaths by Horse or Mule Kick, Prussian Army.1

[!ht]

# of Deaths Frequency
0 144
1 91
2 32
3 11
4 2
N=280

See von Bortkiewicz, L. (1898). Das Gesetz der Kleinen Zahlen. Leipzig: Teubner.

code
library(vcd)
data("VonBort")
horse<-data.frame(VonBort)
horsesum <- horse %>% group_by(deaths) %>% count(deaths) %>% ungroup()

bucolors<-list("#005A43","#6CC24A", "#A7DA92", "#BDBEBD", "#000000" )

hchart(horsesum, "column", hcaes(x=deaths, y=n)) %>% 
  hc_title(text="Deaths by Horse or Mule Kick") %>% 
  hc_subtitle(text="Prussian Army, 19th Century") %>% 
  hc_yAxis(title="Frequency") %>% 
  hc_colors("#005A43") %>%
  hc_xAxis(title="Deaths") %>% 
  hc_legend(enabled=FALSE) %>% 
  hc_tooltip(pointFormat = "Deaths: {point.x}<br>Frequency: {point.y}") %>% 
  hc_credits(enabled=TRUE, text="Source: von Bortkiewicz, L. (1898). Das Gesetz der Kleinen Zahlen. Leipzig: Teubner.") 

How would you characterize the variable measuring deaths by mule kick?

  • Variable measures events.
  • Events are discrete, not continuous.
  • Are events correlated or independent?
  • Variable is bounded; cannot be below zero.

So we need to think about two different things here - the distribution of \(y\) based on its observed distribution, and the link between the \(X\) variables and \(\widetilde{y}\), the latent quantity of interest.

  • What distribution might describe the frequency of mule kick deaths?
  • Needs to be discrete.
  • Needs to characterize rare events - at most we see about four per period, so relatively rare.
  • Needs to have a lower bound at zero (since we can’t observe negative numbers of deaths), and upper bound at \(+\infty\)

\[Pr(Y=y_{i})=\frac{e^{-\lambda}\lambda^{y_{i}}}{y_{i}!}\]

The Poisson Distribution

The Poisson distribution is a discrete distribution that characterizes the number of events that occur in a fixed interval of time (or sometimes, space). Below are 3 Poisson distributions with means of .5, 1.5, and 2.5.

code
#using a large sample, n=1000, simulate and plot the poisson distribution for mean values of .5, 1.5, and 2.5

# Load required libraries
library(ggplot2)
library(tidyr)
library(dplyr)

# Set seed for reproducibility
set.seed(8675309)

# Simulate data
n <- 10000  # n for each distribution
lambdas <- c(0.5, 1.5, 2.5)  # Means 

df <- data.frame(
  lambda_0.5 = rpois(n, lambda = 0.5),
  lambda_1.5 = rpois(n, lambda = 1.5),
  lambda_2.5 = rpois(n, lambda = 2.5)
)

# Reshape the data for ggplot
df_long <- df %>%
  pivot_longer(cols = everything(), 
               names_to = "distribution", 
               values_to = "value") %>%
  mutate(distribution = factor(distribution, 
                               levels = c("lambda_0.5", "lambda_1.5", "lambda_2.5"),
                               labels = c("λ = 0.5", "λ = 1.5", "λ = 2.5")))

# plot 
ggplot(df_long, aes(x = value, fill = distribution)) +
  geom_density(position = "identity", alpha = 0.5, adjust=4) +
  facet_wrap(~ distribution, ncol = 1, scales = "free_y") +
  scale_fill_manual(values = c("#000000", "#6CC24A", "#005A43")) +
  labs(title = "Simulated Poisson Distributions",
       x = "Value",
       y = "Count",
       fill = "Distribution") +
  theme_minimal() +
  theme(legend.position = "none")

Thinking in terms of the data, let’s write a likelihood function, the joint probability for all \(i\) observations in the sample, \(n\):

\[ \mathcal{L}(\lambda)= \prod_{i=1}^{n} \left[\frac{e^{-\lambda}\lambda^{y_i}}{y_i!} \right] \]

Take the natural log of the likelihood function:

\[ \ln \mathcal{L}(\lambda)= \ln \left\{\prod_{i=1}^{n} \left[\frac{e^{-\lambda}\lambda^{y_i}}{y_i!} \right] \right\} \]

\[\ln \mathcal{L}(\lambda)= \sum_{i=1}^{n} \left[-\lambda + y_i \ln(\lambda) - \ln(y_i!) \right] \]

What about the \(X\) variables? Parameterize the model with respect to those variables such that they influence the mean, \(\lambda\). So let’s make \(\lambda\) a function of the \(X\) variables and their effects, \(\beta\), using the exponential distribution as the link function:

\[ E[Y|X]=\lambda =exp(X\beta) \]

The exponential ensures we won’t have negative predictions.

Putting all this together we have:

\[ \ln \mathcal{L}(\lambda)= \sum_{i=1}^{n} \left[-e^{X\beta} + y_i \ln(X\beta) - \ln(y_i!) \right] \]

Estimation Technology

Recall that the technology of OLS is to assume a normally distributed error term, minimize the sum of those squared errors analytically using calculus.

Estimation Technology: MLE

The technology of ML is to maximize the LLF with respect to \(\beta\). We can do this in a couple of different ways:

  • analytic methods - solve calculus. Some/many models do not have analytical or closed form solutions.

  • numerical methods - use an algorithm to estimate starting values for \(\theta\), then hill climb until the first derivative is zero, and the second derivative is negative. This is iterative, trying values, looking at the derivatives. This is what nearly all ML estimation uses - there are different algorithms for doing this. The most commonly used is the Newton Raphson method - it’s illustrated in detail in the maximization slides

Analytic Methods

With some functions, we can solve for the unknowns directly. Let’s return to the Poisson log-likelihood function and consider data on the number of civil wars in Africa over a ten year period, and the data are as follows:

\(y\) = {5 0 1 1 0 3 2 3 4 1}

The log-likelihood function is:

\[\begin{aligned} \ln \mathcal{L}(\lambda|Y) =\ln \left[ \prod\limits_{i=1}^{N} \frac{e^{-\lambda}\lambda^{y_{i}}}{y_{i}!}\right]\nonumber \\ \nonumber \\ =-N \lambda+ \sum(y_{i}) \ln(\lambda) - \sum(\ln(y_{i}!)) \nonumber \end{aligned}\]

\(\lambda\) is the unknown we want to solve for. Taking the derivative with respect to \(\lambda\) and setting equal to zero:

\[\begin{aligned} \frac{\partial \ln \mathcal{L}}{\partial \lambda}=-N \lambda+ \sum(y_{i}) \ln(\lambda) - \sum(\ln(y_{i}!)) \nonumber \\ \nonumber \\ 0=-N + \frac{\sum y_{i}}{\lambda} \nonumber\\ \nonumber \\ \color{red}{\widehat{\lambda}= \frac{\sum y_{i}}{N}} \nonumber \end{aligned}\]

This is just the sample mean of course - let’s plug in our data and solve for \(\lambda\):

\[\widehat{\lambda}= \frac{20}{10} \] So the value of \(\lambda\) the maximizes the function is 2. This is a trivial example in the sense that applications with \(x\) variables are sufficiently complicated that analytical methods are not usually possible, so we turn to numerical methods.

Numerical Methods

Numerical methods are computationally intensive ways to plug in possible parameter values, generate a log likelihood, and then use calculus to evaluate whether the that value is a maximum. We use numerical methods when no analytic or “closed form” solution exists which is essentially all the time.

Do this by evaluating:

  • the first derivative of the LLF - by finding the point on the function where a tangent line has a slope equal to zero, we know we’ve found an inflection point.
  • the second derivative of the LLF - if the rate of change in the function at the very next point is increasing, it’s a minimum; decreasing, it’s a maximum.
  • the Hessian matrix - the matrix of second derivatives - tells us the curvature of the LLF, or the rate of change.

Suppose that we have the event count data reported above representing civil wars in Africa, and that we want to compute the likelihood of \(\lambda|Y\). We can compute the likelihood using numerical methods; one specific technique is a grid search procedure. Just as we might try different values for \(x\) when graphing a function, \(f(x)\) in algebra, we will insert possible values for \(\lambda\) into the log-likelihood function in such a way that we can identify an apparent maximum (a value for \(\lambda\) for which the log-likelihood is at its largest compared to contiguous values of \(\lambda\)). Put another way, we take a guess at the value of \(\lambda\), compute the log-likelihood, and take another guess at \(\lambda\), compute the log-likelihood and compare the two estimates of the likelihood; we repeat this process until a pattern emerges such that we can discern a maximum value.

The log-likelihood function for the poisson distributed data on civil wars is

\[ \ln \mathcal{L}(\lambda|Y)= \ln \left[\frac{e^{-10\lambda}\lambda^{20}}{207360}\right] \nonumber \\ \nonumber \\ = -10 \lambda+ 20 \ln(\lambda) - \ln(207360) \nonumber \]

Suppose we make some guesses regarding the value of \(\lambda\), plug them into the function and compare the resulting values of the log-likelihood - take a look at the code chunk below:

code
#iterate over lambda, create data frame of lambda and log-likelihood
lambda <- seq(0.1, 3.5, by=0.1)
llf <- NULL
for (i in 1:length(lambda)){
  L <- -10*lambda[i] + 20*log(lambda[i]) - log(207360)
  llf <- data.frame(rbind(llf, c(lambda=lambda[i], ll=L)))
}

#highchart with reference line at maximum value of the log-likelihood
highchart() %>% 
  hc_add_series(llf, "line", hcaes(x=lambda, y=ll)) %>% 
  hc_title(text="Log-Likelihood Estimates") %>% 
  hc_subtitle(text="Civil Wars in Africa") %>% 
 hc_xAxis(title = list(text = "Lambda"), plotLines = list(list(value = 2, color="red"))) %>%
  hc_yAxis(title = list(text = "log-likelihood"), plotLines = list(list(value = max(llf$ll), color="red"))) %>%
  hc_tooltip(pointFormat = "Lambda: {point.x}<br>Log-Likelihood: {point.y}") %>% 
  hc_colors("#005A43") 

We can see that the largest value of the likelihood is where \(\lambda\) = 2 - that’s the value that maximizes the likelihood function. And not surprisingly, notice that we have arrived at the same solution we produced in the analytic example above. This is another trivial example insofar as grid search methods are usually not sufficient for solving multivariate problems (nor for computing the variance-covariance matrix).

Numerical method intuition

  • Choose starting values of \(\beta\) (sometimes from OLS) to estimate the log-likelihood.

  • Take the derivative of the log-likelihood with respect to the parameters to find the gradient}. The gradient (or the gradient matrix, a \(kxk\) matrix) tells us the direction of the slope of a line tangent to the curve at the point of the log-likelihood estimate.

  • If the gradient is positive (if the matrix is positive definite), then \(ln \mathcal{L}\) is increasing in \(\beta\) - the slope is increasing, so increase our estimate of \(\beta\) and try again.

  • If the gradient is negative (if the matrix is negative definite), the \(ln \mathcal{L}\) is decreasing in \(\beta\) - the slope is decreasing, so we’ve passed the maximum; choose a smaller value for \(\beta\) and try again.

  • As the log-likelihood approaches the maximum, the gradient approaches zero - the slope of the line tangent to the curve at the point of the log-likelihood estimate is approaching zero, indicating we’re reaching the maximum of the function. Stop the search and evaluate the estimates of \(\beta\) that produced the zero gradient.

  • Throughout this process, we need to evaluate the second derivatives in order to figure out the rate at which the slope is changing; this helps us tell how close or far we are from the maximum. The second derivative describes the curvature of the LLF, or the rate of change.

  • The matrix of second derivatives (the Hessian matrix) or its approximation also provide the source of our estimates of the variance, and thus the standard errors.


The first derivative tells us the direction in which the function is changing. This is obviously important since we’re trying to find the maximum.

Think of this as trying to figure out when you’re exactly at the top of a hill. The slope (the grade, the gradient) is positive while you’re climbing to the top, it’s zero at the top, and it’s negative on the way down the other side.

But is the hill flat or steep? If it’s flat, then the change in the slope between point A and point B is likely to be very small - this, of course, can make it difficult to know exactly when we’re at the top (the maximum). On the other hand, if the hill is very steep, the change in the slope between two points is pretty substantial. Put another way, the rate of change in the slope is larger (faster) the steeper the slope; it’s smaller (slower) the flatter the slope.

This matters to maximization because the second derivatives tell us how big (or small) a step we should take up the hill as we try to find the top. Suppose that the function is very flat; as indicated above, the change in the slope between two points would be small, so we can take larger steps in order to try to find the maximum. The second derivatives would tell us that the rate of change is very small, so we should take larger steps.

The software performing the estimation will choose the next value of \(\beta\) a bit further away from the last value it tried. On the other hand, if the second derivatives are large so the rate of change is fast, we want to take relatively small steps so we don’t step right over the maximum. In any case, that’s the intuition for why we need to know the matrix of second derivatives.

Variance-Covariance matrix

Estimating the second derivatives can be a real nightmare in estimation, but it’s important not only for finding the maximum of the function (and therefore in estimating the \(\beta\)s), but for computing the variance-covariance matrix as well. Here are our options:

  • Find the Hessian. The Hessian is a \(kxk\) matrix of the second derivatives of the log-likelihood function with respect to \(\beta\), where the second derivatives are on the main diagonal. Commonly estimated using the Newton-Raphson algorithm.
  • Find the information matrix. This is the negative of the expected value of the Hessian matrix, computed using the method of scoring.
  • Outer product approximation, where we sum the squares of the first derivatives (thus avoiding the second derivatives all together). This is computed using the Berndt, Hall, Hall, and Hausman} or BHHH algorithm.

Properties of MLEs

  • Consistency - though maximum likelihood estimators are not necessarily unbiased estimators, they are consistent asymptotically. So as the sample size increases, the estimates increasingly resemble the actual population parameters. As a result, MLEs are good large sample estimators, though defining large is not at all straightforward.

  • Asymptotic Normality - the MLEs (the parameters) are themselves distributed according to the standard multivariate normal, no matter what distributional assumptions you make in your model or about your data. This is great because, since MLEs are always normally distributed, we can always describe them using z-scores.

  • Asymptotic Efficiency - the MLE has the smallest asymptotic variance of any estimators that are also consistent and asymptotically normal. See King, p. 80.

  • Invariance - if \(\hat{\Theta_{ML}}\) is the vector of ML estimates, and \(g(\Theta)\) is a continuous function of \(\Theta\), the \(g(\hat{\Theta_{ML}})\) is a consistent estimator of \(g(\Theta)\). So if we transform variables, we can retransform the estimates without losing interpretive ability.

Back to top

Footnotes

  1. See von Bortkiewicz, Ladislaus (1898). Das Gesetz der Kleinen Zahlen. Leipzig: Teubner.↩︎